Adaptive Chinese Word Segmentation

نویسندگان

  • Jianfeng Gao
  • Andi Wu
  • Cheng-Ning Huang
  • Hongqiao Li
  • Xinsong Xia
  • Hauwei Qin
چکیده

This paper presents a Chinese word segmentation system which can adapt to different domains and standards. We first present a statistical framework where domain-specific words are identified in a unified approach to word segmentation based on linear models. We explore several features and describe how to create training data by sampling. We then describe a transformation-based learning method used to adapt our system to different word segmentation standards. Evaluation of the proposed system on five test sets with different standards shows that the system achieves stateof-the-art performance on all of them.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Chinese Word Segmentation with Online Passive-Aggressive Algorithm

In this paper, we describe our system1 for CIPS-SIGHAN-2010 bake-off task of Chinese word segmentation, which focused on the cross-domain performance of Chinese word segmentation algorithms. We use the online passive-aggressive algorithm with domain invariant information for cross-domain Chinese word segmentation.

متن کامل

Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach

This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in rea...

متن کامل

Human-Computer Interactive Chinese Word Segmentation: An Adaptive Dirichlet Process Mixture Model Approach

Previous research shows that Kalman filter based human-computer interactive Chinese word segmentation achieves an encouraging effect in reducing user interventions, but suffers from the drawback of incompetence in distinguishing segmentation ambiguities. This paper proposes a novel approach to handle this problem by using an adaptive Dirichlet process mixture model. By adjusting the hyperparame...

متن کامل

Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection

We present a joint model for Chinese word segmentation and new word detection. We present high dimensional new features, including word-based features and enriched edge (label-transition) features, for the joint modeling. As we know, training a word segmentation system on large-scale datasets is already costly. In our case, adding high dimensional new features will further slow down the trainin...

متن کامل

Statistical Models for Word Segmentation And Unknown Word Resolution

In a Chinese sentence, there are no word delimiters, like blanks, between the “words”. Therefore, it is important to identify the word boundaries before processing Chinese text. Traditional approaches tend to use dictionary lookup, morphological rules and heuristics to identify the word boundaries. Such approaches may not be applied to a large system due to the complicated linguistic phenomena ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004